Grounded Language Learning from Video Described with Sentences

نویسندگان

Haonan Yu

Jeffrey Mark Siskind

چکیده

We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts simultaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Grounded spoken language acquisition: experiments in word learning

| Language is grounded in sensory-motor experience. Grounding connects concepts to the physical world enabling humans to acquire and use words and sentences in context. Currently most machines which process language are not grounded. Instead, semantic representations are abstract, pre-speci ed, and have meaning only when interpreted by humans. We are interested in developing computational syste...

متن کامل

A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video

We present an approach to simultaneously reasoning about a video clip and an entire natural-language sentence. The compositional nature of language is exploited to construct models which represent the meanings of entire sentences composed out of the meanings of the words in those sentences mediated by a grammar that encodes the predicate-argument relations. We demonstrate that these models fait...

متن کامل

Language, Music, and Brain

Introduction: Over the last centuries, scientists have been trying to figure out how the brain is learning the language. By 1980, the study of brain-language relationships was based on the study of human brain damage. But since 1980, neuroscience methods have greatly improved. There is controversy about where music, composition, or the perception of language and music are in the brain, or wheth...

متن کامل

Learning to Compose Spatial Relations with Grounded Neural Language Models

Language is compositional: we can generate and interpret novel sentences by having a notion of meaning of their individual parts. Spatial descriptions are grounded in perceptional representations but their meaning is also defined by what neighbouring words they co-occur with. In this paper we examine how language models conditioned on perceptual features can capture the semantics of composed ph...

متن کامل

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Recently, joint video-language modeling has been attracting more and more attention. However, most existing approaches focus on exploring the language model upon on a fixed visual model. In this paper, we propose a unified framework that jointly models video and the corresponding text sentences. The framework consists of three parts: a compositional semantics language model, a deep video model ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Grounded Language Learning from Video Described with Sentences

نویسندگان

چکیده

منابع مشابه

Grounded spoken language acquisition: experiments in word learning

A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video

Language, Music, and Brain

Learning to Compose Spatial Relations with Grounded Neural Language Models

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

عنوان ژورنال:

اشتراک گذاری